We will begin by loading in the imm_face_db and processing the images and annotations. We will process them in grayscale and normalized float values, and will additionally separate the data into the training set and validation set as described in the specs.
From here, we are going to use our loaded data to train a CNN to detect the nose tips of the individuals. To begin with, we will create the following network structure:
Now after training this network on the training data for 25 epoches, we plot the training and validation loss per epoch.
For reference, here are the best results from that batch, with the predictions plotted in blue, and the true landmarks in red.
As well the worst 2 results...
For both of these cases, it looks like it incorrectly predicted which orientation the face was in, and more specifically, perhaps what shadow or contour corresponds to the nose. It's a low feature NN, so it's likely just too underfit.
Now we will attempt to do the same thing but with the entire keypoint space. We will also augment the data in order to reduce the overfitting.
Here are a few images from the sample, notice the varying brightness and contrast.
And here is the complete architecture of my network.
For my hyper parameters, I mostly picked them at random. I used 6 conv layers as recommended in the spec, each followed by a ReLu layer, and a max pool layers on the first few layers. My learning rate of 0.001 was sufficient in reaching a good enough loss.
I trained this network on the first 80% of the dataset as training, and here is the loss of training and validation.
Here are a couple of the best predictions...
As well as the worst...
It's pretty clear from looking at some of the worst performing results that the network is not very good at facial turns. It is likely overfitting on the standard orientation.